Behavior Based Record Linkage

ABSTRACT

A computer implemented method for matching data records from multiple entities comprising providing respective transaction logs for the entities representing actions performed by or in respect of the entities, determining a matching score using the transaction logs for respective pairs of the entities and for predetermined combinations of merged entities by generating a measure representing a gain in behavior recognition for the entities before and after merging, and using the gain as a matching score.

BACKGROUND

Record linkage (RL) is the process of identifying records that refer tothe same real world entity. Such records can occur over different datasources (e.g., files, websites, databases, etc.), as well as being indifferent formats across similar sources for example. A record linkageprocess can be performed to join or link data sets that do not share acommon identifier such as a database key or URI for example, and it canbe a useful tool when performing data mining tasks, for example. Recordlinkage analysis based on entity behavior has also many otherapplications. For example, identifying common customers for stores thatare considering a merge; tracking users accessing web sites fromdifferent IP addresses; and helping in crime investigations.

A technique which can be used to match data originating from twoentities is to measure the similarity between their behaviors. However,typically, a complete knowledge of an entity's behavior is not availableto both sources since each source is only aware of the entity'sinteraction with that same source. A comparison of an entities' behaviorwill therefore be a comparison of their partial behaviors, which can bemisleading and will generally provide less useful information. Moreover,even in the case where both sources have almost complete knowledge aboutthe behavior of a given entity (such as when a customer who did alltheir grocery shopping at one store for one year and then at anotherstore for another year), a similarity strategy may not help as manyentities do have very similar behaviors. Accordingly, measuring thesimilarity can at best group the entities with similar behavior togetherbut will not typically find their unique matches.

SUMMARY

According to an example, there is provided a computer implemented methodfor matching data records from multiple entities comprising providingrespective transaction logs for the entities representing actionsperformed by or in respect of the entities, determining a matching scoreusing the transaction logs for respective pairs of the entities and forpredetermined combinations of merged entities by generating a measurerepresenting a gain in behavior recognition for the entities before andafter merging, and using the gain as a matching score.

According to an example, there is further provided a computer programembedded on a non-transitory tangible computer readable storage medium,the computer program including machine readable instructions that, whenexecuted by a processor, implement a method for matching data recordsfrom multiple entities comprising providing respective transaction logsfor the entities representing actions performed by or in respect of theentities, determining a matching score using the transaction logs forrespective pairs of the entities and for predetermined combinations ofmerged entities by generating a measure representing a gain in behaviorrecognition for the entities before and after merging, and using thegain as a matching score.

According to an example, there is further provided a method for matchingrecords from multiple sources, comprising determining a coarse match forrecords using a value representing a period of occurrence of certainactions for an entity including data from a merged pair of sources, fora match above a predetermined threshold, using a statistical model todetermine a final matching score.

According to an example, there is further provided a method for matchingrecords from multiple sources, comprising providing respectivetransaction logs for entities representing actions performed by or inrespect of the entities, transforming the logs into a predeterminedformat in order to extract behaviour data for the entities, determininga coarse match for transactions in the logs to provide candidate matchesfor entities across the logs, determining a fine match for transactionsin the logs using a statistical generative model over the candidatematches, and determining matching entities using a score associated withthe coarse and/or fine matches.

BRIEF DESCRIPTION OF THE DRAWINGS

An embodiment of the invention will now be described, by way of examplesonly, and with reference to the accompanying drawings, in which:

FIG. 1 is a schematic block diagram of a process for determining datamatches between two sets of entities according to an example;

FIG. 2 is a set of tables depicting an example of the pre-processing ofraw data;

FIG. 3 is a set of tables depicting an example of determined behaviormatrices for a set of entities;

FIG. 4 is a schematic block diagram of a matching method according to anexample;

FIG. 5 is a set of charts depicting the action patterns in the complexplane and the effect on the magnitude;

FIG. 6 is a schematic block diagram of an apparatus according to anexample;

FIG. 7 is a schematic block diagram of a system according to an example;

FIG. 8 is a schematic block diagram of a method according to an example;

FIG. 9 is a schematic block diagram according to an example; and

FIG. 10 is a schematic block diagram of a method according to anexample.

DETAILED DESCRIPTION

According to an example, there is provided a system and method, whichcan be computer implemented method, for record linkage or data matchingof data records using behavior information stored in transaction log.The method merges behavior information from each of a candidate pair ofentities to be matched. If the two behaviors seem to complete oneanother in the sense that stronger behavioral patterns (such asconsistent repeated patterns, for example) become detectable after themerge, then this provides a strong indication that the two entities are,in fact, the same. A merge strategy according to an example can handlethe case where distinct entities have similar overall behaviors,especially when such behaviors are split across the two sources withdifferent splitting patterns (such as 20%-80% versus 60%-40%, forexample). In this case, two behaviors (from first and second sources,for example) will complete each other if they correspond to the samereal world entity, and not just two distinct entities that happen toshare a similar behavior. A matching strategy according to an examplecan be referred to as a behavior merge strategy, since entities'behaviors are merged and a realized gain is then measured. In analternative strategy, referred to as a behavior similarity strategy,matching scores can be a measure of the similarity between the twobehaviors.

FIG. 1 is a schematic block diagram of a process for determining datamatches between two sets of entities according to an example. Given twosets of entities {A₁, . . . , A_(N) ₁ } and {B₁, . . . , B_(N) ₂ },where for each entity A (and similarly for B) there exists a transactionlog {T₁, . . . , T_(n) _(A) }, a method according to an example returnsthe most likely matches between entities from the two sets in the formof

A_(i),B_(j),S_(m)(A_(i),B_(j))

, where S_(m)(A_(i),B_(j)) is a matching function. Given entities A,B(and their transactions), the matching function returns a scorereflecting the extent to which the transactions of both A and Bcorrespond to the same entity.

A transaction log, from any domain, will typically keep track of certaintypes of information for each action an entity performs. According to anexample, this can include information such as: (1) the time at which theaction occurred, (2) the key object upon which the action was performedand (3) additional information describing the object and how the actionwas performed (e.g., quantity, payment method, etc). For simplicity,each action will be referred to by its key object herein as will becomeapparent below.

For a transaction log {T₁, . . . , T_(n) _(A) }, transaction T_(i) is atuple in the form of

t_(i),a,F_id

in an example, where t_(i) represents the time of the transaction, a isthe action (or event) that took place, and F_id refers to the set offeatures that describe how action a was performed. There can be similartransaction logs for other entities which include respective sets oftuples in the same form.

In block 101, an initial pre-processing and behavior extraction processis performed. More specifically, raw transaction logs from both sourcesare transformed into a standard format (described below), and behaviordata for each single entity in each log is extracted. Behavior data canbe initially represented in a matrix format for example, a “behaviormatrix” BM. The standard format provides a processed log which conformsto a predefined format and which can include additional information foran entity and which can generalise some data elements of actions inorder to provide a more uniform arrangement for example.

In block 103 a candidate generation phase that uses a coarse matchingfunction is used to generate a first set of candidate matches. Whenmatching a pair of entities, a merge strategy is used as will bedescribed in more detail below. In this phase, each row in a behaviormatrix (BM) can be mapped to a 2-dimensional point resulting in acompact representation for the behavior with some acceptable informationloss. Such a mapping allows for very fast computations to be performedon the behavior data of both the original and the merged entities. Insome examples, and depending on the domain knowledge, other techniquescan be applied to further discard candidate matches in this phase asdesired. For example, two customers in a shopping scenario may be deemed“un-mergeable” if they happened to shop in two different stores exactlyat the same time. In an example, coarse matching can be performed bydetermining patterns in the behavior of entities, including mergedentities, and discarding merged entities which exhibit a matching scorebelow a predetermined threshold. Other alternatives are possible, aswill become apparent.

In block 105 entity matching is performed. According to an example,accurate matching of the candidate pair of entities (A; B) is achievedby modelling the behavior of entities A, B, and AB using a statisticalgenerative model, where AB is the entity representing the merge of A andB. The estimated models' parameters can then be used to compute thematching score. In an example, a finite mixture model can be used andexpectation maximisation used to fit the mixture model for each specificaction of an entity to discover the optimal parameter values whichmaximize the likelihood function of the observed behavior data.

In addition to the above-mentioned statistical modelling technique, analternative heuristic technique that is based on information theoreticprinciples can be used in a matching phase 105. Such an alternativetechnique can rely on measuring the increase in the level ofcompressibility as the behavior data of pairs of entities is merged.Such a heuristic technique can be more computationally efficient.

In block 107, final matches are selected by assigning a matching scoreto each pair of entities and reporting the matches with the highestscores. According to an example, a filtering threshold can be applied toexclude low-scoring matches. To further resolve conflicting matchesother techniques such as stable marriage can be used according to anexample.

FIG. 2 is a set of tables depicting an example of the pre-processing ofraw data, such as the pre-processing occurring in block 101 of FIG. 1.The “raw log” in FIG. 2 is a table which includes four columnsrepresenting the time an item was bought by a customer, the customer(the entity to be matched), an ID of an item bought by the customer, anda quantity. Typically, since the item name may be too specific to be thekey identifier for a customer's buying behavior, an alternativeaccording to an example is to use the item category name as theidentifier for the different actions. Accordingly, actions in the senseof FIG. 2 can correspond to buying “Chocolate” and “Cola” rather thanthe specific product for example. The main reason behind thisgeneralization is that, for instance, buying one bar of a particulartype of chocolate should not be considered as a completely differentaction from buying a bar of a different type of chocolate, and so on.Typically, such decisions can be made by a domain expert to avoidover-fitting when modelling behavior. In this case, the specific itemname, along with the quantity, will be considered as additional detailedinformation, which can be referred to as the action features.

The next step is to assign an identification, F_id, for each combinationof features occurring with a specific action in the raw log as shown inthe “Action Description” table of FIG. 2. This ensures that even ifthere are multiple features, they can always be reasoned as a singleobject using a corresponding F_id. Alternatively, if there is only onefeature, then it can be used directly with no need for F_id. Finally,the “Processed Log” table of FIG. 2 is generated by scanning/processingthe raw log and registering the time, entity, action, and F_idinformation for each line or entry.

The processed log table of FIG. 2 represents a standardized log for thetransactions of the given entities which can be used to determine a BM.That is, the transactions of each entity can be extracted andrepresented in a matrix of a particular format. Accordingly, given afinite set of n actions performed over m time units by an entity A, theBehavior Matrix (BM_(i,j)) of entity A is, according to an example, ann×m matrix, such that:

${BM}_{i,j} = \left\{ \begin{matrix}F_{ij} & {{if}\mspace{14mu} {action}\mspace{14mu} a_{i}\mspace{14mu} {is}\mspace{14mu} {performed}} \\0 & {otherwise}\end{matrix} \right.$

Where, F_(i,j)εF_(i) is the F_id value for the combination of featuresdescribing action a_(i) when performed at time j, and F_(i) is thedomain of all possible F_id values for action a_(i), i=1, . . . , n andj=1, . . . , m.

FIG. 3 is a set of tables depicting an example of determined behaviormatrices for a set of entities. More specifically, BM's for entities(such as customers) A, B and C are shown in FIG. 3. A non-zero valueindicates that the action was performed and the value itself is the F_idthat links to the description of the action at this time instant.According to an example, a compact representation for the entities'behavior can be derived from the BM representation, and can beconstructed and used during an accurate matching phase 105. Such asecond compact representation, which is based on the inter-arrivaltimes, considers each row in the BM as a stream or sequence of pairs{v_(ij),F^((v) ^(ij) ⁾}, where v_(ij) is the inter-arrival time sincethe last time action a_(i) occurred, and F^((v) ^(ij) ⁾εF_(i) is afeature that describes a_(i) from L_(a) _(i) possible descriptions,|F_(i)|=L_(a) _(i) . For example, with reference to FIG. 3, the rowcorresponding to action a_(i)=chocolate of entity C,BM_(i)={0,0,5,0,0,4,0,0,0,5,0,0,0,5,0,4}, can be represented asX_(i)={{3,5},{3,4},{4,5},{4,5},{2,4}}.

As described above with reference to FIG. 1, matching entities based ontheir extracted behavior data is achieved in two consecutive phasesaccording to an example: a candidate generation phase followed by a moreaccurate matching phase. Ultimately, it is desired to assign a matchingscore, S_(m), for each pair of entities (A,B) deemed as a potentialmatch, and then report the matches with the highest scores, such as byreporting the highest scoring matches to a user of a system implementingany of the methods or process described herein.

FIG. 4 is a schematic block diagram of a matching method according to anexample. To compute S_(m)(A,B), a behavior recognition score, S_(r), foreach entity (i.e., S_(r)(A) and S_(r)(B)) is computed in blocks 401,403. The behavior data of both A (405) and B (407) are then merged inblock 408 to construct the behavior of some hypothetical entity AB(409), whose score, S_(r)(AB), is also computed in block 411. Accordingto an example, the BM of a merged entity includes the sum of the F-idsof the individual entities for each respective time period.

In block 413 a check is performed using the computed behaviorrecognition scores to see if this merge results in a more recognizablebehavior compared to either of the two individual behaviors. Hence, theoverall matching score depends on a gain achieved for the recognitionscores. More specifically:

$\begin{matrix}{{S_{m}\left( {A,B} \right)} = \frac{{n_{A}\left\lbrack {{S_{r}({AB})} - {S_{r}(A)}} \right\rbrack} + {n_{B}\left\lbrack {{S_{r}({AB})} - {S_{r}(B)}} \right\rbrack}}{n_{A} + n_{B}}} & (1)\end{matrix}$

where n_(A) and n_(B) are the total number of transactions in the BM sof A and B respectively. Note that the gains corresponding to the twoentities are weighted based on the density of their respective BM s.

To better understand the intuition behind the behavior merge strategy,assume that entities A and C are from a source 1 and B is from a source2 and that their processed log is shown in the “Processed Log” table ofFIG. 2. In order to find the best match for entity B, B's behavior ismerged with A's behavior—that is the BM for the entities are merged asdescribed above. The same is performed with C, i.e. A's behavior ismerged with C's behavior. It is apparent from the resulting merged BM'sin FIG. 3 that A is potentially a good match for B; that is, entity ABis likely to be an entity that buys chocolate every 2 or 3 days andprefers to buy 2 liters of cola with either 2 bars chocolate 1 or 4 barsof chocolate 2 for example. However, it is hard to determine a behaviorabout entity BC. In a real world scenario, more actions than asdescribed above can be dealt with.

According to an example, a recognition score S_(r) represents theconsistency of an entity's behavior along three main components: (1)consistency in repeating actions, (2) stability in the featuresdescribing the action and (3) the association between actions. Thesethree components, which will be described more fully below, arerepresented by three score components for S_(r); S_(r1), S_(r2), andS_(r3). In an example, S_(r)(A) is computed from the geometric mean ofthese three components according to:

S _(r)(A)=³√{square root over (S _(r1)(A)×S _(r2)(A)×S _(r3)(A))}{squareroot over (S _(r1)(A)×S _(r2)(A)×S _(r3)(A))}{square root over (S_(r1)(A)×S _(r2)(A)×S _(r3)(A))}  (2)

According to an example, consistency in repeating actions means thatentities tend to repeat specific actions on a regular basis followingalmost consistent inter-arrival times. For example, a user (an entity)of a news web site may be checking the financial news (an action) everymorning (a pattern). Stability in the features describing actions meansthat when an entity performs an action several times, almost the samefeatures are expected to apply each time. For example, when a customerbuys chocolate, s/he may mostly buys either 2 types of one chocolate baror 1 type of another bar, as opposed to buying a different type ofchocolate each time and in completely different quantities. The lattercase is unlikely to occur in real scenarios. Association between actionsmeans that actions performed by entities are typically associated witheach other, and the association patterns can be detected over time. Forexample, a customer may be used to buying two particular but otherwiseunrelated items together every Sunday afternoon, which implies anassociation between these two actions.

As mentioned above, a candidate generation phase, such as that describedwith reference to block 103 of FIG. 1 is used according to an example toavoid examining all possible pairs of entities in a computationallyexpensive phase dedicated to more accurate matching. A candidategeneration phase can quickly determine pairs of entities that are likelyto be matched. This phase can result in almost no false negatives, atthe expense of relatively low precision.

The high efficiency of this candidate generation phase is primarily dueto the use of a compact (yet lossy) behavior representation, whichallows for fast computations. In addition, only the first behaviorcomponent; i.e., consistency in repeating actions, which is captured byS_(r1), is considered in this phase. Note that because the two othercomponents are ignored, binary BM s are used with 1's replacing non-zerovalues according to an example—that is, a binarised version of a BM isused according to an example.

Each row in the BM, which corresponds to an action, is considered as abinary time sequence. For each such sequence, the first element of itsDiscrete Fourier Transform (DFT) is computed, which is a 2-dimensionalcomplex number. The complex number C_(A) ^((a) ^(i) ⁾ corresponding toan action a_(i) in the BM of an entity A is computed according to:

$\begin{matrix}{C_{A}^{(a_{i})} = {\sum\limits_{j = 0}^{m - 1}{{BM}_{i,j}^{\frac{2\; j\; \pi \sqrt{- 1}}{m}}}}} & (3)\end{matrix}$

According to an example, it is noted that the lower the magnitude of thecomplex number, the more consistent and regular the time sequence, andvice versa. If each of the elements in the time series are considered asa vector whose magnitude is either 0 or 1, and their angles areuniformly distributed along the unit circle (i.e., the angle of thej^(th) vector is

$\left. \frac{2\; j\; \pi}{m} \right),$

then the complex number will be the resultant of all these vectors. Ifthe time series was typically consistent in terms of the inter-arrivaltimes between the non-zero values, then their corresponding vectorswould be uniformly distributed along the unit circle, and hence theywould cancel each other out. Thus, the resultant's magnitude will becloser to zero for a more uniform distribution representing a moreconsistent inter-arrival time. Also, note that merging the two rowscorresponding to an action a in the BM s of two entities, A, B, isreduced to adding two complex numbers i.e., C_(AB) ^((a))=C_(A)^((a))+C_(B) ^((a)).

The following example shows how a candidate generation phase candistinguish between “match” and “mismatch” candidates. FIG. 5 is a setof charts depicting the action patterns in the complex plane and theeffect on the magnitude. Let a_(A), a_(B) and a_(C) be the rows ofaction a (chocolate) in the binary BM s of entities A, B and C from FIG.3. In the chart on the left of FIG. 5, when merging a_(A) and a_(B), themagnitude corresponding to the merged action, a_(AB) equals 0.19, whichis smaller than the original magnitudes: 1.38 for a_(A) and 1.53 fora_(B). The reduction in magnitude is because the sequence a_(AB) is moreregular than either of a_(A) and a_(B).

In the chart on the right of FIG. 5, the same process for a_(B) anda_(C) is applied. The magnitudes are 2.03 for a_(BC), 1.54 for a_(B),and 0.09 for a_(C). In this case, merging a_(B) and a_(C) results in anincrease in magnitude because the sequence a_(BC) is less regular thaneither of a_(B) and a_(C). Accordingly, based on the above, arecognition score, S_(r)(a_(A)), for each individual action a can becomputed that belongs to entity A and which is inversely proportional tothe magnitude of the complex number C_(A) ^((a)). In particular,S_(r)(a_(A))=M−mag(C_(A) ^((a))), where mag(C_(A) ^((a))) is themagnitude of C_(A) ^((a)) and M is the maximum computed magnitude.

To compute the overall S_(r)(A), the individual scores, S_(r)(a_(A)),are averaged, each weighted by the number of times its respective actionwas repeated (n_(A) ^((a))). The formula for S_(r)(A) is thus given asfollows:

$\begin{matrix}{{{S_{r}(A)} = {\frac{1}{n_{A}}{\sum\limits_{\forall\; a}n_{A}^{(a)}}}}{\cdot {S_{r}\left( a_{A} \right)}}} & (4)\end{matrix}$

After computing the complex numbers representation for each action in anentity, S_(r)(a_(A))=M−mag(C_(A) ^((a))), where M is the maximumcomputed magnitude can be computed. Accordingly:

${S(A)} = {\frac{1}{n_{A}}{\sum\limits_{\forall\; a}{n_{A}^{(a)}\left( {M - {{mag}\left( C_{A}^{(a)} \right)}} \right)}}}$

By substituting the above into Eq. 1, the matching score S_(m)(A,B) canbe computed:

${S_{m}\left( {A,B} \right)} = {{\frac{n_{A}}{n_{A} + n_{B}}\begin{bmatrix}{{\frac{1}{n_{A} + n_{B}}{\sum\limits_{\forall\; a}{\left( {n_{A}^{(a)} + n_{B}^{(a)}} \right)\left( {M - {{mag}\left( C_{AB}^{(a)} \right)}} \right)}}} -} \\{\frac{1}{n_{A}}{\sum\limits_{\forall\; a}{\left( n_{A}^{(a)} \right)\left( {M - {{mag}\left( C_{A}^{(a)} \right)}} \right)}}}\end{bmatrix}} + {\frac{n_{B}}{n_{A} + n_{B}}\begin{bmatrix}{{\frac{1}{n_{A} + n_{B}}{\sum\limits_{\forall\; a}{\left( {n_{A}^{(a)} + n_{B}^{(a)}} \right)\left( {M - {{mag}\left( C_{AB}^{(a)} \right)}} \right)}}} -} \\{\frac{1}{n_{B}}{\sum\limits_{\forall\; a}{\left( n_{B}^{(a)} \right)\left( {M - {{mag}\left( C_{B}^{(a)} \right)}} \right)}}}\end{bmatrix}}}$

By simple rearrangement to collect the terms related to mag(C_(AB)^((a))).

${S_{m}\left( {A,B} \right)} = {\frac{1}{n_{A} + n_{B}}{\sum\limits_{\forall\; a}\left\lbrack {{\left( {n_{A}^{(a)} + n_{B}^{(a)}} \right)M} - {\left( {n_{A}^{(a)} + n_{B}^{(a)}} \right){{mag}\left( C_{AB}^{(a)} \right)}} - {n_{A}^{(a)}M} + {n_{A}^{(a)}{{mag}\left( C_{A}^{(a)} \right)}} - {n_{B}^{(a)}M} + {n_{B}^{(a)}{{mag}\left( C_{B}^{(a)} \right)}}} \right\rbrack}}$

Note that the terms of M will cancel out and the final matching scoreaccording to an example is given by:

${S_{m}\left( {A,B} \right)} = {\frac{1}{n_{A} + n_{B}}{\sum\limits_{\forall\; a}\left\lbrack {{n_{A}^{(a)}{{mag}\left( C_{A}^{(a)} \right)}} - {n_{B}^{(a)}{{mag}\left( C_{B}^{(a)} \right)}} - {\left( {n_{A}^{(a)} + n_{B}^{(a)}} \right){{mag}\left( C_{AB}^{(a)} \right)}}} \right\rbrack}}$

In an example, complex number information for each data source can bestored in a relation with the attributes (entity, action, Re, Im, mag,a_supp, e_supp), where there is a tuple for each entity and its actions.For each action of an entity, the real and imaginary components (Re andIm) of the complex number as well as the magnitude (mag) can be stored.a_supp is the number of transaction for that action within the entitieslog and e_supp is total number of transactions for the entity repeatedwith each tuple corresponding an action. Thus, there are two tablesrepresenting each of the two data sources src1 and src2. To generate thecandidates, a solution to the above equation is computed for each pairof entities and the results are filtered using a threshold t, on theresulting matching score such that only matches above the threshold arepassed for example.

According to an example, accurate matching following candidategeneration is performed using a statistical model for the behavior of anentity given its observed actions. The two key variables defining anentity's behavior with respect to a specific action are (1) theinter-arrival time between the action occurrences, and (2) the featureid (F_id) associated with each occurrence, which represents the featuresdescribing how the action was performed at that time, or in other wordsit reflects the entity's preferences when performing this action.

Typically, an entity will be biased to a narrow set of inter-arrivaltimes and feature ids which is what will distinguish the entity'sbehavior. In merging two behavior matrices for the same entity, the biasshould generally be enforced. However, when the behavior matrices of twodifferent entities are merged, the bias will instead typically beweakened and harder to recognize.

A system and method according to an example uses a generated model forthe behavior of an entity to determine motifs or patterns by separatingthem from some background sequence that is random in nature. In thepresent case, a motif can correspond to a sequence of an action by thesame entity. A model according to an example will fit for: (a) sequencesof two variables (inter-arrivals and feature id) and (b) for ordinalvariables (such as the inter-arrival time), neighboring values need tobe treated similarly.

In an example, the behavior of an entity A with respect to a specificaction a can be modeled using a finite mixture model M={M₁, . . . ,M_(K)}, with mixing coefficients λ^(a) ^(A) ⁾={λ₁ ^((a) ^(A) ⁾, . . . ,λ_(K) ^((a) ^(A) ⁾}, where M_(k) is its k^(th) component. Each componentM_(k) is associated with two random variables: (i) The inter-arrival,which is generated from a uniform distribution over the range ofinter-arrival times, r_(k)=start_(k),end_(k). (ii) The feature id, whichis a discrete variable, and which is modeled using a multinomialdistribution with parameter θ_(k) ^((a) ^(A) ⁾={f_(k1) ^((a) ^(A) ⁾, . .. , f_(kL) ^((a) ^(A) ⁾}, where L is the number of all possible featureids, and f_(kj) ^((a) ^(A) ⁾ is the probability to describe theoccurrence of action a using feature F_(j), j=1, . . . , L. The rangesize of r_(k) is user-configurable and can depend on the application.For example, ranges can be generated by sliding, over the time period, awindow of size 5 days with a step of 3 days. (i.e.{{1,6},{4,9},{7,12},}). Other alternatives are possible.

For the sake of clarity, the superscript a_(A) is omitted hereinafter,and it is assumed that there is only one action in the system so as tosimplify the notations. The model for an entity can be considered to bea generative model in the sense that once built, it can be used togenerate new action occurrences for the entity. For example, using λ,the component M_(k) to generate the next action occurrence can beselected, and this should occur after an inter-arrival time picked fromthe corresponding range r_(k)=start_(k),end_(k). The action can bedescribed by selecting a feature id using θ_(k). According to anexample, the estimated parameters of the model (λ and the vectors θ_(k))are used to determine a measure representing the level at which repeatedpatterns in a sequence corresponding to action occurrences arerecognized.

For example, consider that a customer's behavior with respect to theaction of buying chocolate is represented by the sequence{{6,s},{15,l},{6,s},{8,s},{15,l},{14,l},{13,l}}, where s denotes a smallquantity (e.g., 1-5 bars), and l denotes a large quantity (e.g., morethan 5 bars). A small quantity of chocolate was bought after 6 days, anda large quantity after 15 days, and so on. To characterize theinter-arrival times preferred by this customer, the best ranges of size2 to use are [6,8] and [13,15]. Their associated mixing coefficients(λ_(k)) should be 3/7 and 4/7 because the two ranges cover 3 and 4respectively out of the 7 observed data points. However, since ingeneral, the best ranges in a behavior sequence will not be as clear asin this case, all the ranges of a given size (2 in this case) areconsidered, and mixing coefficients are assigned to each of them. Thepossible ranges in this example would therefore be {[6,8],[7,9],[8,10],. . . ,[13,15]}.

An approach to compute λ_(k) for each range is to compute the normalizedfrequency of occurrence of the given range for all the observed datapoints. For example, the normalized frequencies for the ranges [6,8],[12,14], and [13,15] are 3/12, 2/12, and 4/12 (or ¼, ⅙, and ⅓)respectively, where 12 is the sum of frequencies for all possibleranges. Note that the same inter-arrival time may fall in multipleoverlapping ranges. Clearly, these are not the desired values for A_(k).It is desired to have zero values for all ranges other than [6,8] and[13,15]. However, these normalized frequencies can still be used as theinitial values for λ_(k) to be fed into an expectation maximizationalgorithm as will be described below.

Similarly, to compute the initial values for the θ_(k) probabilities,the data points covered by the range corresponding to component M_(k)are considered. Then, for each possible value of the feature id, itsnormalized frequency across these data points is computed. In theexample above, the customer favors buying small quantities when s/heshops at short intervals (6-8 days apart), and large quantities whens/he shops at longer intervals (13-15 days apart).

According to an example, and as mentioned above, an ExpectationMaximization (EM) algorithm is used in order to fit the mixture modelfor each specific action a of an entity A to discover the optimalparameter values which maximize the likelihood function of the observedbehavior data. To simplify the notations, it is assumed that there isonly one action in the system, and so the superscript that links theentity and action names is omitted for the sake of clarity.

As described, a model consists of K components M={M₁, . . . , M_(K)},where each component M_(i) describes the occurrence of an action usingtwo variables: r_(k)=start,end with a uniform distribution, whichrepresents a range of inter-arrival time of the action, and θ_(k), whichrepresents an independent random variable describing a multinomial trialwith parameters θ_(k)={f_(k1), . . . , f_(kL)} where L is the number ofpossible features to describe the action when it occurs. f_(kj) is theprobability to describe an action using feature F_(j) in componentM_(k). The different parameters f_(kj), with k={1, . . . , K} and j={1,. . . , L}, are estimated from the entity's transaction log. The overallmodel of the pattern is achieved by estimating the components' mixingcoefficient λ={λ₁, . . . , λ_(K)}. λ_(k), with Σ_(k=1) ^(K)λ_(k)=1, isthe probability of using component M_(k) to get the next entry {v,F(v)}in the sequence X; i.e. after how many time units, v, the action willoccur and how it will be described, F^((v)). Accordingly, the parametersfor the overall model of an action are the mixing coefficient λ and thevector θ_(k) for each component M_(k), where k={1, . . . , M}.

According to an example, the EM uses the concept of missing data andfollows an iterative procedure to find values for λ and θ, whichmaximizes the likelihood of the data given the model. The missing datais the knowledge of which components produced X={{v₁,F^((v) ¹ ⁾}, . . ., {v_(N),F^((v) ^(N) ⁾}}. A finite mixture model assumes that thesequence X arises from two or more components with different, unknownparameters. Once these parameters are obtained, they are used to computethe behavior scores along each of the behavior three components.

In an example, a K-dimensional binary random variable Z with a 1-of-Krepresentation in which a particular z_(k) is equal to 1 and all otherelements are equal to 0, i.e., z_(k)ε{0,1} and Σ_(k=1) ^(K)z_(k)=1, suchthat the probability p(z_(k)=1)=λ_(k). Every entry in the sequence X_(i)is assigned Z_(i)={z_(i1), z_(i2), . . . , z_(iK)}, Accordingly, theprobability:

$\begin{matrix}{{p\left( {\left. X_{i} \middle| \theta_{1} \right.,\ldots \mspace{14mu},\theta_{K}} \right)} = {\sum\limits_{k = 1}^{K}{{p\left( {z_{ik} = 1} \right)}{p\left( {\left. X_{i} \middle| Z_{i} \right.,\theta_{1},\ldots \mspace{14mu},\theta_{K}} \right)}}}} \\{= {\sum\limits_{k = 1}^{K}{\lambda_{k}{p\left( X_{i} \middle| \theta_{k} \right)}}}}\end{matrix}$

Since z_(ik) is not known, the conditional probability γ(z_(ik)) ofz_(ik) given X_(i) is considered to be p(z_(ik)=1|X_(i)), which can befound using Bayes' theorem:

$\begin{matrix}\begin{matrix}{{\gamma \left( z_{ik} \right)} = \frac{{p\left( {z_{ik} = 1} \right)}{p\left( {\left. X_{i} \middle| z_{ik} \right. = 1} \right)}}{\sum\limits_{k = 1}^{K}{{p\left( {z_{ik} = 1} \right)}{p\left( {\left. X_{i} \middle| z_{ik} \right. = 1} \right)}}}} \\{= \frac{\lambda_{k}{p\left( X_{i} \middle| \theta_{k} \right)}}{\sum\limits_{k = 1}^{K}{\lambda_{k}{p\left( X_{i} \middle| \theta_{k} \right)}}}}\end{matrix} & (7)\end{matrix}$

The λ_(k) is viewed as the prior probability of z_(ik)=1, and γ(z_(ik))as the corresponding posterior probability once X is obtained. γ(z_(ik))can also be viewed as the ‘responsibility’ that component M_(k) takesfor explaining the observation X_(i). Therefore, the likelihood orprobability of the data given the parameters can be written in the logform as:

$\begin{matrix}\begin{matrix}{{\ln \; {p\left( {\left. X \middle| \lambda \right.,\theta} \right)}} = {\sum\limits_{i = 1}^{N}{\sum\limits_{k = 1}^{K}{{\gamma \left( z_{ik} \right)}{\ln \left\lbrack {\lambda_{k}{p\left( X_{i} \middle| \theta_{k} \right)}} \right\rbrack}}}}} \\{= {{\sum\limits_{i = 1}^{N}{\sum\limits_{k = 1}^{K}{{\gamma \left( z_{ik} \right)}\ln \; {p\left( X_{i} \middle| \theta_{k} \right)}}}} +}} \\{{\sum\limits_{i = 1}^{N}{\sum\limits_{k = 1}^{K}{{\gamma \left( z_{ik} \right)}\ln \; \lambda_{k}}}}}\end{matrix} & (5)\end{matrix}$

According to an example, the EM algorithm monotonically increases thelog likelihood of the data until convergence by iteratively computingthe expected log likelihood of the complete data (X,Z) in the E step andmaximizing this expected log likelihood over the model parameters λ andθ. Some initial values for the parameters λ⁽⁰⁾ and θ⁽⁰⁾ are chosen, andthen the E-step and M-step of the algorithm are alternated between untilconvergence. In the E-step, to compute the expected log likelihood ofthe complete data, the required conditional distribution γ⁽⁰⁾(z_(ik)) iscomputed. The values λ⁽⁰⁾ and θ⁽⁰⁾ are used with equation 5 above tocompute γ⁽⁰⁾(z_(ik)). p(X_(i)|θ_(k)) can be computed as follows:

$\begin{matrix}{{p\left( X_{i} \middle| \theta_{k} \right)} = {\prod\limits_{j = 1}^{L}\; f_{kj}^{I{({j,k,F^{(v_{i})}})}}}} & (6)\end{matrix}$

where X_(i)={v_(i),F^((v) ^(i) ⁾} and I(j,k,F^((v) ^(i) ⁾) is anindicator function equal to 1 if v_(i)εr_(k) and F^((v) ^(i) ⁾=F_(j);otherwise it is 0. Recall that r_(k)=start,end is the period identifyingthe component M_(k).

The M-step of the EM process maximizes equation 6 over λ and θ in orderto re-estimate new values for them: λ⁽¹⁾ and θ⁽¹⁾. The maximization overinvolves only the second term in equation 6, and argmax_(λ)Σ_(i=1)^(N)Σ_(k=1) ^(K)γ(z_(ik))ln λ_(k), has the solution

$\begin{matrix}{{\lambda_{k}^{(1)} = {\frac{1}{N}{\sum\limits_{i = 1}^{N}{\gamma^{(0)}\left( z_{ik} \right)}}}},{k = 1},\ldots \mspace{14mu},{K.}} & (7)\end{matrix}$

In order to maximize over θ, the first term in equation 6 is maximizedseparately over each σ_(k) for k={1, . . . , K}. Accordingly,argmax_(θ)E(logp(X,Z|θ_(i), . . . , θ_(K))] is equivalent to maximizingthe right hand side of equation 8 over θ_(k) (only a piece of theparameter) for every k:

$\begin{matrix}{{\theta_{k} = {\arg \; {\max_{\theta_{k}}{\sum\limits_{i = 1}^{N}{{\gamma^{(0)}\left( z_{ik} \right)}\ln \; {p\left( X_{i} \middle| \theta_{k} \right)}}}}}},} & (8)\end{matrix}$

To do this, for k={1, . . . , K} and j={1, . . . , L} let

$\begin{matrix}{c_{kj} = {\sum\limits_{i = 1}^{N}{{\gamma^{(0)}\left( z_{ik} \right)}{I\left( {j,k,F^{(v_{i})}} \right)}}}} & (9)\end{matrix}$

Then c_(kj) is in fact the expected number of times to describe theaction by F_(j) when its inter-arrival falls in M_(k)'s range r_(k). Theθ_(k) can be re-estimated by substituting equation 7 into equation 9 toprovide:

$\begin{matrix}{\theta_{k}^{(1)} = {\left\{ {{\hat{f}}_{k\; 1},\ldots \mspace{14mu},{\hat{f}}_{kL}} \right\} = {\arg \; {\max_{\theta_{k}}{\sum\limits_{j = 1}^{L}{c_{kj}\ln \; f_{kj}}}}}}} & (10)\end{matrix}$

Therefore:

$\begin{matrix}{{\hat{f}}_{kj} = \frac{c_{kj}}{\sum\limits_{j = 1}^{L}c_{kj}}} & (11)\end{matrix}$

To find the initial parameters λ⁽⁰⁾ and θ⁽⁰⁾, the sequence X can bescanned once and equation 10 can be used to determine the c_(kj) bysetting all) γ⁽⁰⁾=1. Following this, equation 11 can be used to computeθ_(k) ⁽⁰⁾ and:

$\lambda_{k}^{(0)} = \frac{\sum\limits_{j = 1}^{L}c_{kj}}{\sum\limits_{k = 1}^{K}{\sum\limits_{j = 1}^{L}c_{kj}}}$

As described, to match two entities A and B, the gain S_(m)(A,B) inrecognizing a behavior after merging A and B is computed usingequation 1. This requires computing the scores S_(r)(A), S_(r)(B) andS_(r)(AB) using equation 2, which in turn requires computing thebehavior recognition scores corresponding to the three behaviorcomponents, which, for entity A for example, are S_(r1)(A), S_(r2)(A),and S_(r3)(A).

As mentioned, for the first behavior component, the consistency inrepeating an action a is equivalent to classifying its sequence as amotif, and the pattern strength can be quantified to be inverselyproportional to the uncertainty about selecting a model component usingλ^((a) ^(A) ⁾. That is to say, action a's sequence is a motif if theuncertainty about λ^((a) ^(A) ⁾ is low. Thus, entropy can be used tocompute S_(r1)(a_(A))=log K−H(λ^((a) ^(A) ⁾, where H(λ^((a) ^(A)⁾)=−Σ_(k=1) ^(K)λ_(k) ^((a) ^(A) ⁾ log λ_(k) ^((a) ^(A) ⁾, and theoverall score S_(rl)(A) is then computed by a weighted sum over all theactions according to their support, i.e., the number of times the actionwas repeated.

$\begin{matrix}{{S_{r\; 1}(A)} = {\frac{1}{n_{A}}{\sum\limits_{\forall\; a}{n_{A}^{(a)} \cdot {S_{r\; 1}\left( a_{A} \right)}}}}} & (15)\end{matrix}$

For the second behavior component, the stability in describing theaction (action features) is more recognizable when the uncertainty inpicking the feature id values is low. According to an example, thebehavior score along this component can be evaluated by first computingθ′^(a) ^(A) ⁾={f₁′^((a) ^(A) ⁾, . . . , f_(L)′^((a) ^(A) ⁾}, is theoverall parameter to pick a feature id value for action a using themultinomial distribution such that the overall probability for entity Ato describe its action a by feature F_(j) is f^(j)′^((a) ^(A) ⁾. Here,f_(j)′^((a) ^(A) ⁾=Σ_(k=1) ^(K)λ_(k) ^((a) ^(A) ⁾f_(kj) ^((a) ^(A) ⁾ iscombined from the all K components for j=1, . . . , L, knowing thatθ_(k) ^((a) ^(A) ⁾={f_(k1) ^((a) ^(A) ⁾, . . . , f_(kL) ^((a) ^(A) ⁾}.Using the entropy of θ′^((a) ^(A) ⁾, S_(r2)(a^(A))=log L−H(θ′^((a) ^(A)⁾) is computed, where H(θ′^((a) ^(A) ⁾)=−Σ_(j=1) ^(L)f_(j)′^((a) ^(A) ⁾log f_(j)′^((a) ^(A) ⁾. The overall score for S_(r2)(A) can be computedas the weighted sum for S_(r2)(a_(A)) according to the actions support.

For the third component, evidence about the associations between actionsis determined. For every pair of actions, its probability of beinggenerated from components with the same inter-arrival ranges isestimated. The association between actions can be recognized when theyoccur close to each other. In other words, this can occur when both ofthem tend to prefer the same model components to generate theirsequences. For example, the score for the third component can becomputed over all possible pairs of actions for the same entity asfollows:

${S_{r\; 3}(A)} = {\sum\limits_{{\forall\; a},b}{\sum\limits_{k = 1}^{K}{\lambda_{k}^{(a_{A})}\lambda_{k}^{(b_{A})}}}}$

According to an example, the similarity between two behaviors can bequantified by the closeness between the parameters of theircorresponding behavior models computed using the Euclidean distance. Fortwo entities A and B, a behavior similarity BSim(A,B) can therefore becomputed as:

${{BSim}\left( {A,B} \right)} = {1 - {\frac{1}{n_{A} + n_{B}}{\sum\limits_{\forall\; a}\left( {n_{A}^{a} + n_{B}^{a}} \right)}}}$$\sqrt{\sum\limits_{k = 1}^{K}\left\lbrack {\left( {\lambda_{k}^{(a_{A})} - \lambda_{k}^{(a_{B})}} \right)^{2} + {\sum\limits_{j = 1}^{L}\left( {{\lambda_{k}^{(a_{A})}f_{kj}^{(a_{A})}} - {\lambda_{k}^{(a_{B})}f_{kj}^{(a_{B})}}} \right)^{2}}} \right\rbrack}.$

In an example, this method may be preferred over directly comparing theBM s of the entities, since the latter method would require alignmentfor the time dimension of the BM s. In particular, deciding which cellsto compare to which cells may not be obvious.

According to an example, an information theory-based technique for thecomputation of the matching scores can also be used. Although such atechnique will typically not be as accurate as the technique describedabove, it can be more computationally efficient. The underlying ideastems from observing that if a BM is represented as an image, there willbe horizontal repeated blocks that would be more recognizable if thebehavior is well recognized. The repeated blocks appear because of therepetition in the behavior patterns. Therefore, more regularity alongthe rows than along the columns of the BM can be expected. In fact, theorder of values in any of the columns depends on the order of theactions in the BM, which is not expected to follow any recognizablepatterns. For these reasons, the BM can be compressed on a row by rowbasis, rather than compressing the entire matrix as a whole.

Typically, compression techniques exploit data repetition and encode itin a more compact representation. According to an example,compressibility can be used as a measure of confidence in recognizingbehaviors. For example, a BM can be compressed using the DCT compressiontechnique, being one of the most commonly used compression techniques inpractice. The compression ratios can then be used to compute thebehavior recognition scores. Significantly higher compression ratiosimply a more recognizable behavior.

Given the sequence representation of an action occurrence i.e. {{v_(j),F^((v) ^(j) ⁾}}, if an entity follows stability in repeating an action,the values v_(j)'s will follow a certain level of correlation showingthe action rate. Moreover, the features values F^((v) ^(j) ⁾ willcontain similar values to describe how the action was performed.

An aim is to compute the three behavior recognition scores along thethree behavior components described above. For the first behaviorcomponent, the sequence

{v₁, …  , v_(n_(A)^((a)))}

can be compressed, which represents the inter-arrival times for eachaction a. The behavior score, S_(r1)(a_(A)) for action a of entity A,will then be the resultant compression ratio; the higher the compressionratio, the more a consistent inter-interval time (motif) can berecognized. The equation for the final matching score can be used tocompute the overall score S_(r1)(A). Similarly, for the second behaviorcomponent, the sequence

{F^((v₁)), …  , F^((v_(n_(A)^((a)))))}

can be compressed, which represents the feature values that describe theaction a. Again, the score S_(r2)(a_(A)) is the generated compressionratio; the higher the compression ratio, the more that stability inaction features can be recognized. Similarly to S_(r2)(a_(A)), theoverall score S_(r2)(A) can be computed.

Finally, for the third behavior component, which evaluates therelationship between the actions, the concatenated sequences ofinter-arrival times of every possible pair of actions can be compressed.Given two actions a and b, concatenation occurs and then compression ofthe inter-arrival times is performed to arrive at the compression ratiocr_(a,b). If a and b are closely related, they will have similarinter-arrival times allowing for better compressibility of theconcatenated sequence. On the contrary, if they are not related, theconcatenated sequence will contain varying values. Thus, cr_(a,b)quantifies the association between actions a and b. Hence, the overallpairwise association provides a measure for the strength in therelationship between the actions that can be computed by:

$S_{r\; 3}^{(A)} = {\sum\limits_{{\forall\; a},b}{cr}_{a,b}}$

FIG. 6 is a schematic block diagram of an apparatus according to anexample suitable for implementing a system, method or process describedabove. Apparatus 600 includes one or more processors, such as processor601 which can be a multi-core processor, providing an execution platformfor executing machine readable instructions such as software. Commandsand data from the processor 601 are communicated over a communicationbus 399. The system 600 also includes a main memory 602, such as aRandom Access Memory (RAM), where machine readable instructions mayreside during runtime, and a secondary memory 605. The secondary memory605 includes, for example, a hard disk drive 607 and/or a removablestorage drive 630, representing a floppy diskette drive, a magnetic tapedrive, a compact disk drive, etc., or a nonvolatile memory where a copyof the machine readable instructions or software may be stored. Thesecondary memory 605 may also include ROM (read only memory), EPROM(erasable, programmable ROM), EEPROM (electrically erasable,programmable ROM). In addition to software, data representing any one ormore of candidate matches, matches, matching scores, entities,transaction logs (raw or processed), and listings for tuples may bestored in the main memory 602 and/or the secondary memory 605. Theremovable storage drive 630 reads from and/or writes to a removablestorage unit 609 in a well-known manner.

A user can interface with the system 600 using one or more input devices611, such as a keyboard, a mouse, a stylus, and the like in order toprovide user input data or manipulate certain computed results forexample. The display adaptor 615 interfaces with the communication bus399 and the display 617 and receives display data from the processor 601and converts the display data into display commands for the display 617.A network interface 619 is provided for communicating with other systemsand devices via a network (not shown). The system can include a wirelessinterface 621 for communicating with wireless devices in the wirelesscommunity.

It will be apparent to one of ordinary skill in the art that one or moreof the components of the system 600 may not be included and/or othercomponents may be added as is known in the art. The system 600 shown inFIG. 6 is provided as an example of a possible platform that may beused, and other types of platforms may be used as is known in the art.One or more of the steps described above may be implemented asinstructions embedded on a computer readable medium and executed on thesystem 600. The steps may be embodied by a computer program, which mayexist in a variety of forms both active and inactive. For example, theymay exist as software program(s) comprised of program instructions insource code, object code, executable code or other formats forperforming some of the steps. Any of the above may be embodied on acomputer readable medium, which include storage devices and signals, incompressed or uncompressed form. Examples of suitable computer readablestorage devices include conventional computer system RAM (random accessmemory), ROM (read only memory), EPROM (erasable, programmable ROM),EEPROM (electrically erasable, programmable ROM), and magnetic oroptical disks or tapes. Examples of computer readable signals, whethermodulated using a carrier or not, are signals that a computer systemhosting or running a computer program may be configured to access,including signals downloaded through the Internet or other networks.Concrete examples of the foregoing include distribution of the programson a CD ROM or via Internet download. In a sense, the Internet itself,as an abstract entity, is a computer readable medium. The same is trueof computer networks in general. It is therefore to be understood thatthose functions enumerated above may be performed by any electronicdevice capable of executing the above-described functions.

In an example, modules 603, 605 in memory 602 can include datarepresenting a finite mixture model for an entity and a set ofparameters determined using an expectation maximization process. Memory602 can also include a module comprising data representing a set ofcandidate matches computed using a DFT as described above.

FIG. 7 is a schematic block diagram of a system according to an example.Apparatus 700 of FIG. 7 is similar to that of FIG. 6. A database 701,which can be a remote database for example, stores data such as entityinformation and transaction logs. Apparatus 700 is operatively coupledto the database 701, for example by a communication link using the NIC619 or wireless interface 621, and can receive and send data from and tothe database 701. Accordingly, a finite mixture model module 603 can beused to perform matches for the data residing on the database 701, and acomputed set of parameters 605 can be used to provide a matching scoreto a user via the display 617 for example.

FIG. 8 is a block diagram of a method according to an example. As such,in a computer implemented method for matching data records 801 frommultiple entities 800, respective transaction logs 803 for the entities800 representing actions performed by or in respect of the entities 800are provided. In block 805, a matching score is determined using thetransaction logs 803 for respective pairs of the entities and forpredetermined combinations of merged entities by generating a measure807 representing a gain in behavior recognition for the entities 800before and after merging. In block 809 the gain is used as a matchingscore.

FIG. 9 is a block diagram according to an example, in which there is acomputer program embedded on a non-transitory tangible computer readablestorage medium 900, the computer program including machine readableinstructions 901 that, when executed by a processor 903, implement amethod for matching data records from multiple entities 907 comprisingproviding respective transaction logs 905 for the entities 907representing actions performed by or in respect of the entities,determining a matching score 911 using the transaction logs 905 forrespective pairs of the entities 907 and for predetermined combinationsof merged entities by generating a measure 909 representing a gain inbehavior recognition for the entities before and after merging, andusing the gain as a matching score.

FIG. 10 is a block diagram of a method according to an example. In block1007 a coarse match for records 1003 is determined using a valuerepresenting a period of occurrence 1000 of certain actions 1005 for anentity including data from a merged pair of sources, for a match above apredetermined threshold 1009, using a statistical model 1011 todetermine a final matching score 1013.

1. A computer implemented method for matching data records from multipleentities comprising: providing respective transaction logs for theentities representing actions performed by or in respect of theentities; determining a matching score using the transaction logs forrespective pairs of the entities and for predetermined combinations ofmerged entities by generating a measure representing a gain in behaviorrecognition for the entities before and after merging, and using thegain as a matching score.
 2. A method as claimed in claim 1, furthercomprising converting the transaction logs to a predetermined format toprovide a processed log including data from the transaction logs and aset of identifiers representing combinations of features for respectiveactions.
 3. A method as claimed in claim 2, further comprisinggenerating a behavior matrix for an entity using the identifiers.
 4. Amethod as claimed in claim 1, wherein determining a matching scoreincludes determining if a merged entity exhibits a consistent behaviorcompared to a behavior pattern of actions for the individual entities.5. A method as claimed in claim 1, wherein determining a matching scoreincludes determining a behavior recognition score for a behavior of anentity representing consistency of the entities behavior over multiplecomponents.
 6. A method as claimed in claim 5, wherein the multiplecomponents include a measure representing consistency in repeatingactions, a measure representing stability in features describingactions, and a measure representing association between actions.
 7. Amethod as claimed in claim 3, further comprising generating a firstelement of a discrete Fourier transform for a binarised behavior matrixin which non zero values are replaced with the value “1” to provide acomplex number representing an action in the behavior matrix.
 8. Amethod as claimed in claim 7, wherein the inverse of the magnitude ofthe complex number represents a recognition score for an action of anentity.
 9. A method as claimed in claim 1, further comprising generatinga statistical model for the behavior of an entity using the transactionlog for that entity, wherein parameters of the model are used todetermine a measure for repeated patterns in a sequence corresponding toaction occurrences for an entity.
 10. A computer program embedded on anon-transitory tangible computer readable storage medium, the computerprogram including machine readable instructions that, when executed by aprocessor, implement a method for matching data records from multipleentities comprising: providing respective transaction logs for theentities representing actions performed by or in respect of theentities; determining a matching score using the transaction logs forrespective pairs of the entities and for predetermined combinations ofmerged entities by generating a measure representing a gain in behaviorrecognition for the entities before and after merging, and using thegain as a matching score.
 11. A computer program embedded on anon-transitory tangible computer readable storage medium as claimed inclaim 10, the computer program including machine readable instructionsthat, when executed by a processor, implement a method for matching datarecords from multiple entities further comprising converting thetransaction logs to a predetermined format to provide a processed logincluding data from the transaction logs and a set of identifiersrepresenting combinations of features for respective actions.
 12. Acomputer program embedded on a non-transitory tangible computer readablestorage medium as claimed in claim 11, the computer program includingmachine readable instructions that, when executed by a processor,implement a method for matching data records from multiple entitiesfurther comprising generating a behavior matrix for an entity using theidentifiers.
 13. A computer program embedded on a non-transitorytangible computer readable storage medium as claimed in claim 10, thecomputer program including machine readable instructions that, whenexecuted by a processor, implement a method for matching data recordsfrom multiple entities wherein determining a matching score includesdetermining if a merged entity exhibits a consistent behavior comparedto a behavior pattern of actions for the individual entities.
 14. Acomputer program embedded on a non-transitory tangible computer readablestorage medium as claimed in claim 10, the computer program includingmachine readable instructions that, when executed by a processor,implement a method for matching data records from multiple entitieswherein determining a matching score includes determining a behaviorrecognition score for a behavior of an entity representing consistencyof the entities behavior over multiple components.
 15. A computerprogram embedded on a non-transitory tangible computer readable storagemedium as claimed in claim 14, the computer program including machinereadable instructions that, when executed by a processor, implement amethod for matching data records from multiple entities wherein themultiple components include a measure representing consistency inrepeating actions, a measure representing stability in featuresdescribing actions, and a measure representing association betweenactions.
 16. A computer program embedded on a non-transitory tangiblecomputer readable storage medium as claimed in claim 12, the computerprogram including machine readable instructions that, when executed by aprocessor, implement a method for matching data records from multipleentities further comprising generating a first element of a discreteFourier transform for a binarised behavior matrix in which non zerovalues are replaced with the value “1” to provide a complex numberrepresenting an action in the behavior matrix.
 17. A computer programembedded on a non-transitory tangible computer readable storage mediumas claimed in claim 16, the computer program including machine readableinstructions that, when executed by a processor, implement a method formatching data records from multiple entities, wherein the inverse of themagnitude of the complex number represents a recognition score for anaction of an entity.
 18. A computer program embedded on a non-transitorytangible computer readable storage medium as claimed in claim 10, thecomputer program including machine readable instructions that, whenexecuted by a processor, implement a method for matching data recordsfrom multiple entities further comprising generating a statistical modelfor the behavior of an entity using the transaction log for that entity,wherein parameters of the model are used to determine a measure forrepeated patterns in a sequence corresponding to action occurrences foran entity.
 19. A method for matching records from multiple sources,comprising: determining a coarse match for records using a valuerepresenting a period of occurrence of certain actions for an entityincluding data from a merged pair of sources; for a match above apredetermined threshold, using a statistical model to determine a finalmatching score.